107 research outputs found

    Base calling for high-throughput short-read sequencing: dynamic programming solutions

    Get PDF
    Shreepriya Das and Haris Vikalo are with the Electrical and Computer Engineering Department, The University of Texas at Austin, Austin, Texas 78712, USABackground: Next-generation DNA sequencing platforms are capable of generating millions of reads in a matter of days at rapidly reducing costs. Despite its proliferation and technological improvements, the performance of next-generation sequencing remains adversely affected by the imperfections in the underlying biochemical and signal acquisition procedures. To this end, various techniques, including statistical methods, are used to improve read lengths and accuracy of these systems. Development of high performing base calling algorithms that are computationally efficient and scalable is an ongoing challenge. Results: We develop model-based statistical methods for fast and accurate base calling in Illumina’s next-generation sequencing platforms. In particular, we propose a computationally tractable parametric model which enables dynamic programming formulation of the base calling problem. Forward-backward and soft-output Viterbi algorithms are developed, and their performance and complexity are investigated and compared with the existing state-of-the-art base calling methods for this platform. A C code implementation of our algorithm named Softy can be downloaded from https://sourceforge.net/projects/dynamicprog webcite. Conclusions: We demonstrate high accuracy and speed of the proposed methods on reads obtained using Illumina’s Genome Analyzer II and HiSeq2000. In addition to performing reliable and fast base calling, the developed algorithms enable incorporation of prior knowledge which can be utilized for parameter estimation and is potentially beneficial in various downstream applications.Electrical and Computer [email protected]

    On the sphere-decoding algorithm I. Expected complexity

    Get PDF
    The problem of finding the least-squares solution to a system of linear equations where the unknown vector is comprised of integers, but the matrix coefficient and given vector are comprised of real numbers, arises in many applications: communications, cryptography, GPS, to name a few. The problem is equivalent to finding the closest lattice point to a given point and is known to be NP-hard. In communications applications, however, the given vector is not arbitrary but rather is an unknown lattice point that has been perturbed by an additive noise vector whose statistical properties are known. Therefore, in this paper, rather than dwell on the worst-case complexity of the integer least-squares problem, we study its expected complexity, averaged over the noise and over the lattice. For the "sphere decoding" algorithm of Fincke and Pohst, we find a closed-form expression for the expected complexity, both for the infinite and finite lattice. It is demonstrated in the second part of this paper that, for a wide range of signal-to-noise ratios (SNRs) and numbers of antennas, the expected complexity is polynomial, in fact, often roughly cubic. Since many communications systems operate at noise levels for which the expected complexity turns out to be polynomial, this suggests that maximum-likelihood decoding, which was hitherto thought to be computationally intractable, can, in fact, be implemented in real time - a result with many practical implications

    On the sphere-decoding algorithm II. Generalizations, second-order statistics, and applications to communications

    Get PDF
    In Part 1, we found a closed-form expression for the expected complexity of the sphere-decoding algorithm, both for the infinite and finite lattice. We continue the discussion in this paper by generalizing the results to the complex version of the problem and using the expected complexity expressions to determine situations where sphere decoding is practically feasible. In particular, we consider applications of sphere decoding to detection in multiantenna systems. We show that, for a wide range of signal-to-noise ratios (SNRs), rates, and numbers of antennas, the expected complexity is polynomial, in fact, often roughly cubic. Since many communications systems operate at noise levels for which the expected complexity turns out to be polynomial, this suggests that maximum-likelihood decoding, which was hitherto thought to be computationally intractable, can, in fact, be implemented in real-time-a result with many practical implications. To provide complexity information beyond the mean, we derive a closed-form expression for the variance of the complexity of sphere-decoding algorithm in a finite lattice. Furthermore, we consider the expected complexity of sphere decoding for channels with memory, where the lattice-generating matrix has a special Toeplitz structure. Results indicate that the expected complexity in this case is, too, polynomial over a wide range of SNRs, rates, data blocks, and channel impulse response lengths

    Maximum-Likelihood Sequence Detection of Multiple Antenna Systems over Dispersive Channels via Sphere Decoding

    Get PDF
    Multiple antenna systems are capable of providing high data rate transmissions over wireless channels. When the channels are dispersive, the signal at each receive antenna is a combination of both the current and past symbols sent from all transmit antennas corrupted by noise. The optimal receiver is a maximum-likelihood sequence detector and is often considered to be practically infeasible due to high computational complexity (exponential in number of antennas and channel memory). Therefore, in practice, one often settles for a less complex suboptimal receiver structure, typically with an equalizer meant to suppress both the intersymbol and interuser interference, followed by the decoder. We propose a sphere decoding for the sequence detection in multiple antenna communication systems over dispersive channels. The sphere decoding provides the maximum-likelihood estimate with computational complexity comparable to the standard space-time decision-feedback equalizing (DFE) algorithms. The performance and complexity of the sphere decoding are compared with the DFE algorithm by means of simulations

    On joint detection and decoding of linear block codes on Gaussian vector channels

    Get PDF
    Optimal receivers recovering signals transmitted across noisy communication channels employ a maximum-likelihood (ML) criterion to minimize the probability of error. The problem of finding the most likely transmitted symbol is often equivalent to finding the closest lattice point to a given point and is known to be NP-hard. In systems that employ error-correcting coding for data protection, the symbol space forms a sparse lattice, where the sparsity structure is determined by the code. In such systems, ML data recovery may be geometrically interpreted as a search for the closest point in the sparse lattice. In this paper, motivated by the idea of the "sphere decoding" algorithm of Fincke and Pohst, we propose an algorithm that finds the closest point in the sparse lattice to the given vector. This given vector is not arbitrary, but rather is an unknown sparse lattice point that has been perturbed by an additive noise vector whose statistical properties are known. The complexity of the proposed algorithm is thus a random variable. We study its expected value, averaged over the noise and over the lattice. For binary linear block codes, we find the expected complexity in closed form. Simulation results indicate significant performance gains over systems employing separate detection and decoding, yet are obtained at a complexity that is practically feasible over a wide range of system parameters

    Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm

    Get PDF
    Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.National Science Foundation CCF-1320273Electrical and Computer Engineerin

    On noise processes and limits of performance in biosensors

    Get PDF
    In this paper, we present a comprehensive stochastic model describing the measurement uncertainty, output signal, and limits of detection of affinity-based biosensors. The biochemical events within the biosensor platform are modeled by a Markov stochastic process, describing both the probabilistic mass transfer and the interactions of analytes with the capturing probes. To generalize this model and incorporate the detection process, we add noisy signal transduction and amplification stages to the Markov model. Using this approach, we are able to evaluate not only the output signal and the statistics of its fluctuation but also the noise contributions of each stage within the biosensor platform. Furthermore, we apply our formulations to define the signal-to-noise ratio, noise figure, and detection dynamic range of affinity-based biosensors. Motivated by the platforms encountered in practice, we construct the noise model of a number of widely used systems. The results of this study show that our formulations predict the behavioral characteristics of affinity-based biosensors which indicate the validity of the model

    Haplotype Assembly: An Information Theoretic View

    Full text link
    This paper studies the haplotype assembly problem from an information theoretic perspective. A haplotype is a sequence of nucleotide bases on a chromosome, often conveniently represented by a binary string, that differ from the bases in the corresponding positions on the other chromosome in a homologous pair. Information about the order of bases in a genome is readily inferred using short reads provided by high-throughput DNA sequencing technologies. In this paper, the recovery of the target pair of haplotype sequences using short reads is rephrased as a joint source-channel coding problem. Two messages, representing haplotypes and chromosome memberships of reads, are encoded and transmitted over a channel with erasures and errors, where the channel model reflects salient features of high-throughput sequencing. The focus of this paper is on the required number of reads for reliable haplotype reconstruction, and both the necessary and sufficient conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa
    corecore